Method and system for performing parallel integer multiply accumulate operations on packed data

ABSTRACT

A multiply accumulate unit (“MAC”) that performs operations on packed integer data. In one embodiment, the MAC receives 2 32-bit data words which, depending on the specified mode of operation, each contain either four 8-bit operands, two 16-bit operands, or one 32-bit operand. Depending on the mode of operation, the MAC performs either sixteen 8×8 operations, four 16×16 operations, or one 32×32 operation. Results may be individually retrieved from registers and the corresponding accumulator cleared after the read cycle. In addition, the accumulators may be globally initialized. Two results from the 8×8 operations may be packed into a single 32-bit register. The MAC may also shift and saturate the products as required.

CROSS REFERENCE TO RELATED APPLICATION

This application claims the benefit of provisional United States PatentApplication entitled “Digital Signal Coprocessor,” application No.60/492,060, filed on Jul. 31, 2003.

FIELD OF THE INVENTION

This invention relates to a multiply-accumulate unit of a processor,particularly a multiply-accumulate unit which can perform parallelinteger multiply accumulate operations on packed data.

BACKGROUND ART

Multiply-accumulate units (“MACs”) perform multiplication andaccumulation operations in a single instruction cycle in a processor.Usually, the result of a multiplication operation is added, oraccumulated to, another result stored in an accumulator, or register.These units are often used to speed up video/graphics applications aswell as digital signal processor operations such as convolution andfiltering.

Single instruction, multiple data (“SIMD”) style processing has beenused to accelerate multimedia processing. Instruction sets forprocessors often include SIMD instructions where multiple data elementsare packed in a single wide register, with the individual data elementsoperated on in parallel; One example is Intel's MMX (multimediaextension) TM instruction set. This parallel operation on data elementsaccelerates processing.

As noted above, MAC operations are used to accelerate variousapplications. In addition to speed, it would be desirable to have anarchitecture that is capable of handling multiply and accumulateoperations for different-sized operands as required by the instruction(i.e., 8×8 operations, 16×16 operations, etc.). It would also bedesirable to be able to retrieve individual results of MAC operationsand clear the corresponding accumulator. In addition, it would beadvantageous to have a MAC unit which could provide the cross-product ofoperands, pack results into one register, and shift results wheredesired.

SUMMARY OF THE INVENTION

These goals have been met by a MAC that performs multiply accumulateoperations on packed integer data. In one embodiment, the MAC receives 232-bit data words which, depending on the specified mode of operation,each contain either four 8-bit operands, two 16-bit operands, or one32-bit operand. Depending on the mode of operation, the MAC performseither sixteen 8×8 operations, four 16×16 operations, or one 32×32operation. Results may be individually retrieved from registers and thecorresponding accumulator cleared after the read cycle. In addition, theaccumulators may be globally initialized. Two results from the 8×8operations may be packed into a single 32-bit register. The MAC may alsoshift and saturate the products as required.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of the multiply accumulate unit (“MAC”) of theinvention.

FIG. 2 is a block diagram of a processor status word used with theapparatus of FIG. 1.

FIG. 3 is a chart of modes of operation and resulting operands, numberof operations per cycle, and obtained results for the apparatus of FIG.1.

FIG. 4 is a block diagram of data words used as input in the apparatusof FIG. 1.

FIG. 5 is a block diagram of a 16×16 multiplier block in the apparatusof FIG. 1.

FIG. 6 is a block diagram of a saturation circuit in the apparatus ofFIG. 1.

FIG. 7 a is a block diagram of a shift and saturate circuit in theapparatus of FIG. 1.

FIG. 7 b is a block diagram of a shift and saturate circuit in theapparatus of FIG. 1.

FIG. 8 is a block diagram of a combined saturation circuit in theapparatus of FIG. 1.

DETAILED DESCRIPTION OF THE INVENTION

In one embodiment of the invention, the MAC is part of a digital signalengine (“DSE”) coprocessor. In FIG. 1, a conceptual block diagram of theMAC unit 10 features sixteen 8×8 multipliers 12, each with acorresponding adder 14, accumulator 18, and accumulator register 22. Inthis embodiment, the adder 14 is a 20-bit adder, the accumulator 18 is a20-bit accumulator, and the register 22 is a 20-bit register. A preclearmultiplexer 20 is coupled to the adder 14 and is used to initialize theaccumulators 18. A postclear multiplexer 18 is also coupled to the adder14 and is used to clear any accumulator register 22 that has beenaccessed in order to retrieve the result of MAC operations. The preclear20 and postclear 28 multiplexers are set by inputs 28, 30 received bythe MAC unit 10. In addition, the unit 10 receives input (for instance,in a processing instruction) indicating whether the accumulator productshould be saturated (SA 34) and/or whether the product should be shiftedand saturated (SSP 32). The unit 10 is able to send overflow bits 24 toother registers in the processor, for instance hardware registers.

A DSE processor status word (“PSW”) register controls processoroperation in one embodiment of the invention. In FIG. 2, the PSW 122 is32 bits long and includes the DSE program counter 124, which holds theaddress of the next DSE instruction to be executed. For purposes of theinvention, the other bits of interest include bits 26 and 27, MACM0 128and MACM₁ 130, which indicate the mode in which the MAC operates: BitBit 27 26 Mode 0 0 Default mode 0 1 8-bit packed mode (8 x 8 mode) 1 016-bit packed mode (16 x 16 mode) 1 1 32-bit mode (32 x 32 mode)Bit 28, the SA bit 132, indicates whether the accumulator value shouldbe saturated (i.e., if the bit is set to “1,” the value is saturated).Bit 29, the SSP bit 134, indicates whether the product should be shiftedand saturated (i.e., if the bit is set to “1,” the product is shiftedand saturated). The remaining bits 136 are used to control processoroperation. The use of the PSW and the assignment of bits is includedhere as an example; in other embodiments, the operation of the MAC maybe controlled in other ways.

The MAC of the invention receives two z-bit words, each containing anumber of m-bit operands, and, depending on the operation modedetermined by an instruction, performs a number of m×m multiplyaccumulate operations. Results of the multiply accumulate operations areplaced in accumulator registers, which may be accessed individually inorder to retrieve results. FIG. 3 shows that in one embodiment of theinvention, the MAC receives two 32-bit words as input which eachcontain, depending on the mode of operation, four independent 8-bitoperands (8×8 mode), two independent 16-bit operands (16×16 mode), andone 32-bit word (32×32 mode). In both 8×8 and 16×16 modes, each operandis independently configured as signed or unsigned. In 8×8 mode, sixteen8×8 MACs may be performed per cycle, resulting in sixteen 16-bitproducts accumulated into sixteen signed 20-bit accumulator registers.In 16×16 mode, four 16×16 MACs may be performed per cycle, with four32-bit products accumulated into 4 signed 40-bit accumulator registers.In 32×32 mode, one 32×32 MAC is performed per cycle and one 64-bitproduct is accumulated into one of four signed 80-bit accumulatorregisters. Other embodiments of the invention may perform MAC operationson operands containing a different number of bits than listed above, forinstance 64-bit operands.

Referring to FIG. 1, the MAC unit 10 receives two data words, A 38 and B36, as input as well as an indication (for instance, in the instruction)of whether A 38 and B 36 are signed or unsigned 42, 40. The MAC unit 10receives an activation signal 26 that also determines what mode it willoperate in for the cycle, i.e., 8×8 mode 50, 16×16 mode 48, 24×24 mode46 (in one embodiment, the MAC unit's 10 default mode is to operate as a24×24 floating point MAC), or 32×32 mode 44.

As shown in FIG. 4, the data words A 38 and B 36 in one embodimentconsist of 32 bits (or four bytes) apiece (in other embodiments, thewords may consist of a larger or fewer number of bits). Depending on themode of operation, each word may consist of one 32-bit operand 54 (i.e.,DCBA and W3W2W1W0), two 16-bit operands 56 (i.e., DC, BA, W3W2, andW1W0, where D and W3 are the most significant bytes and A and W0 are theleast significant bytes), or four 8-bit operands 58 (i.e., D, C, B, A,W3, W2, W1, and W0 where D and W3 are the most significant bytes and Aand W0 are the least significant bytes).

As noted above, when the MAC unit operates in 8×8 mode, the results ofsixteen 8×8 MAC operations are placed in sixteen 20-bit accumulatorregisters, or packed byte integer MAC accumulator registers (PBIMs). Anexample of how operands and the accumulator registers (here labeled 0through 15) may be mapped follows:

-   -   PBIM15+=D*W3    -   PBIM14+=D*W2    -   PBIM13+=C*W3    -   PBIM12+=C*W2    -   PBIM11+=D*W1    -   PBIM10+=D*W0    -   PBIM9+=C*W1    -   PBIM8+=C*W0    -   PBIM7+=B*W3    -   PBIM6+=B*W2    -   PBIM5+=A*W3    -   PBIM4+=A*W2    -   PBIM3+=B*W1    -   PBIM2+=B*W0    -   PBIM1+=A*W1    -   PBIM0+=A*W0        In the preclear case (when the accumulators are set to “0”), the        “+=” is replaced by “=.” The accumulator registers are logical        registers and can be implemented in any way so that the        registers are shared regardless of the MAC's mode of operation.

In 16×16 mode, the results of four 16×16 multiply accumulate operationsare placed in 40-bit accumulator register, or packed half-word integerMAC (“PHIM”) accumulator registers. An example of how operands and PHIMaccumulator registers (here labeled 0 through 3) may be mapped follows:

-   -   PHIM3+=DC*W3W2    -   PHIM2+=DC*W1W0    -   PHIM1+=BA*W3W2    -   PHIM0+=BA*W1W0        In the preclear case, the “+=” is replaced by “=.” The        accumulator registers are logical registers and can be        implemented in any way so that the registers are shared        regardless of the MAC's mode of operation.

In 32×32 mode, the results of the single 32×32 multiply accumulateoperation is placed in one of four 80-bit accumulator registers, orunpacked integer MAC (UIM) accumulator registers. Which UIM register isused is determined by instruction type. An example of how the operandsand UIM accumulator registers (where n is a number from 0 to 3) may bemapped follows: UIM(n)+=DCBA*W3W2W1W0.

In the preclear case, the “+=” is replaced by “=.” The accumulatorregisters are logical register and can be implemented in any way so thatthe registers are shared regardless of the MAC's mode of operation.

In one embodiment, the PBIM, PHIM, and UIM registers use the same shared320 bits as indicated in the following table. In other embodiments,other approaches may be employed. PBIM0[19:0] UIM0[19:0] PBIM1[19:0]UIM0[39:20] PBIM2[19:0] UIM0[59:40] PBIM3[19:0] UIM0[79:60] PBIM4[19:0]PHIM0[19:0] UIM1[19:0] PBIM5[19:0] PHIM0[39:20] UIM1[39:20] PBIM6[19:0]PHIM1[19:0] UIM1[59:40] PBIM7[19:0] PHIM1[39:20] UIM1[79:60] PBIM8[19:0]PHIM2[19:0] UIM2[19:0] PBIM9[19:0] PHIM2[39:20] UIM2[39:20] PBIM10[19:0]PHIM3[19:0] UIM2[59:40] PBIM11[19:0] PHIM3[39:20] UIM2[79:60]PBIM12[19:0] UIM3[19:0] PBIM13[19:0] UIM3[39:20] PBIM14[19:0]UIM3[59:40] PBIM15[19:0] UIM3[79:60]

In FIG. 5, when the MAC is in 16×16 mode, the input words A 38 and B 36are divided into 16-bit segments and sent to 16×16 multiplier blocks 62which are described in greater detail 78 below. When the 16×16multiplier block 62 is to determine the product of BA*W1W0, theindividual operands B 86, A 84, W1 82, and WO 80 are input to 8×8multiplier blocks 12. The multiplication operations are carried out andthe results are output to an adder 64, which is a 16×16 partial productassembler, and each multiplier's 12 20-bit accumulator 18. The resultsmay be sign extended 66 as necessary before being placed in theaccumulators 18.

An indication 38, 36 of whether the operands are signed is provided (inone embodiment, as will be discussed in greater detail below, in theinstruction). The accumulators 18 may add their contents to the productsof the multipliers 12 unless a pre- or postclear operation has beenperformed 70, in which case the content of the accumulator 18 is forcedto “0.” The products placed in the accumulator 18 are determined by theMAC's mode of operation 26. For instance, in 16×16 mode, the partialproduct from the adder 64 is passed through a multiplexer 68 and to theaccumulator 18. However, in 8×8 mode, the product of the 8×8 operationis passed through the multiplexer to the accumulator 18. Overflow bits24 (discussed in greater detail below) are sent to the appropriateregister 76. The products of the accumulators 18 are then sent to anorder swap multiplexer 74 and then on to the accumulator registers.

Instructions are used to initiate packed integer MAC operations. In oneembodiment, the instruction also specifies whether the operands aresigned or unsigned. The following instructions, for use with Cradle'sDSE coprocessor, are illustrative of the type of instructions that maybe used with the MAC. Other instructions may be used in otherembodiments. In the following table, the index “k” of the accumulatordepends on the indices “i” and “j” of the packed operands. InstructionAction Comment PIMACUU A[i] * B[j] + PIM[k] A, B unsigned; 8 x 8, →PIM[k] 16 x 16 mode (PIM is the accumulator value) PIMACSU A[i] * B[j] +PIM[k] A signed, B unsigned; → PIM[k] 8 x 8, 16 x 16 mode PIMACSS A[i] *B[j] + PIM[k] A, B signed, 8 x 8, 16 x 16 → PIM[k] mode PIMACPUU A[i] *B[j] → PIM[k] A, B unsigned; 8 x 8, 16 x 16 mode; preclear allaccumulators PIMACPSU A[i] * B[j] → PIM[k] A signed, B unsigned; 8 x 8,16 x 16 mode; preclear all accumulators PIMACPSS A[i] * B[j] → PIM[k] A,B signed; 8 x 8, 16 x 16 mode; preclear all accumulators IMAC0 A * B +M[j] → M[j] A, B unsigned; 32 x 32 mode; destination register UIM0 (M isthe accumulator value) IMAC1 A * B + M[j] → M[j] A, B unsigned; 32 x 32mode; destination register UIM1 IMAC2 A * B + M[j] → M[j] A, B unsigned;32 x 32 mode; destination register UIM2 IMAC3 A * B + M[j] → M[j] A, Bunsigned; 32 x 32 mode; destination register UIM3 IMACP0 A * B + M[j] →M[j] A, B unsigned; 32 x 32 mode; destination register UIM0; preclearaccumulator IMACP1 A * B + M[j] → M[j] A, B unsigned; 32 x 32 mode;destination register UIM1; preclear accumulator IMACP2 A * B + M[j] →M[j] A, B unsigned; 32 x 32 mode; destination register UIM2; preclearaccumulator IMACP3 A * B + M[j] → M[j] A, B unsigned; 32 x 32 mode;destination register UIM3; preclear accumulatorIn embodiments where the MAC can also operate as a 24×24 floating pointMAC (“FMAC”), the instructions can have the same opcodes as the FMAC.

The accumulator registers may be accessed using move-like instructions(i.e., the registers are used as source operands in move instructions).In one embodiment, the following logical registers may be accessed forresults; other embodiments may employ a different approach.

-   -   1) Registers for getting sign-extended 20-bit results for the        8×8 case        -   a) PBIM0-PBIM15—a 20-bit accumulator register value can be            moved sign-extended to a 32-bit dual port data memory            (“DPDM”) register in the DSE.        -   b) PBIMC0-PBIMC0—a 20-bit accumulator register value can be            moved sign-extended to 32-bit DPDM register in the DSE; the            accumulator is cleared at the end of the read cycle.    -   2) Registers for getting sign-extended upper 16-bit results for        the 8×8 case        -   a) UPBIM0-UPBIM15—the sixteen most significant bits (“msbs”)            of the 20-bit accumulator register value can be moved            sign-extended to a 32-bit DPDM register in the DSE.        -   b) UPBIMC0-UPBIMC15—the sixteen msbs of the 20-bit            accumulator register value can be moved sign-extended to a            32-bit DPDM register in the DSE; the accumulator is cleared            at the end of the read cycle.            Note: Extracting the upper 16-bits and sign-extending the            value is not integer division by sixteen with rounding            towards zero. It is division by sixteen with rounding            towards negative infinity.    -   3) Registers for getting two 16-bit results for the 8×8 case        packed into a single 32-bit register        -   a) PLPBIMC0-PLPBIMC7—the sixteen least significant bits            (“lsbs”) of two 20-bit accumulator register values can be            packed into one DPDM register; the accumulator is cleared at            the end of the read cycle.    -   b) PUPBIMCO-PUPBIMC₇—the sixteen msbs of two 20-bit accumulator        register values can be packed into one DPDM register; the        accumulator is cleared at the end of the read cycle.

4) Registers for getting results for the 16×16 case a) PHIMTO-PHIMT₃—the32 msbs of a 40-bit accumulator register value can be moved into a32-bit DPDM register.

-   -   b) PHIMTCO-PHIMTC₃—the 32 msbs of a 40-bit accumulator register        value can be moved into a 32-bit DPDM register; the accumulator        is cleared at the end of the read cycle.    -   c) PHIMUO-PHIMU₃—the 8 msbs of a 40-bit accumulator register        value can be moved sign-extended into a 32-bit DPDM register.

5) Registers for getting results for the 32×32 case a) UIMLO-UIML₃—the32 lsbs of an 80-bit accumulator register value can be moved into a32-bit DPDM register.

-   -   b) UIMU0-UIMU3—the 32 msbs of an 80-bit accumulator register        value can be moved into a 32-bit DPDM register.

The MAC unit described herein uses a two-stage pipeline. During the DSEexecute stage, operands are clocked into the MAC pipeline. Results areavailable 2 cycles later. A register holds overflow bits from the MAC.In one embodiment, the overflow register is a read-only hardwareregister. The following tables show which overflow bits are visibledepending on the MAC mode. Other embodiments may use a differentapproach. MAC Mode Bit Function 00 31:0 reserved 01 31:18 16-bit PBIM(n)accumulator overflow bits 01 17 16-bit PBIM1 accumulator overflow bit 0116 16-bit PBIM0 accumulator overflow bit 01 15:2 20-bit PBIM(n)accumulator overflow bits 01  1 20-bit PBIM1 accumulator overflow bit 01 0 20-bit PBIM0 accumulator overflow bit 10 31:8 reserved 10  7 32-bitPHIM3 accumulator overflow bit 10  6 32-bit PHIM2 accumulator overflowbit 10  5 32-bit PHIM1 accumulator overflow bit 10  4 32-bit PHIM0accumulator overflow bit 10  3 40-bit PHIM3 accumulator overflow bit 10 2 40-bit PHIM2 accumulator overflow bit 10  1 40-bit PHIM1 accumulatoroverflow bit 10  0 40-bit PHIM0 accumulator overflow bit 11 31:8reserved 11  7 64-bit UIM3 accumulator overflow bit 11  6 64-bit UIM2accumulator overflow bit 11  5 64-bit UIM1 accumulator overflow bit 11 4 64-bit UIM0 accumulator overflow bit 11  3 80-bit UIM3 accumulatoroverflow bit 11  2 80-bit UIM2 accumulator overflow bit 11  1 80-bitUIM1 accumulator overflow bit 11  0 80-bit UIM0 accumulator overflow bit

Signed overflow occurs when the two inputs to the accumulator adder havethe same sign but the output of the adder has the opposite sign. If Aand B are the inputs to the adder and Sum is the output, the accumulatoroverflow bits are defined as follows: 16-bit PBIM(n) overflow=(A[15]XNOR B[15]) AND (A[15] XOR Sum[15]) 20-bit PBIM(n) overflow=(A[19] XNORB[19]) AND (A[19] XOR Sum[191]) 32-bit PHIM(n) overflow=(A[31] XNORB[31]) AND (A[31] XOR Sum[311) 40-bit PHIM(n) overflow=(A(39] XNORB[39]) AND (A[39] XOR Sum[39])

When both operands are signed, or only operand A is signed, overflow iscalculated for MAC operations in one of two ways depending on theembodiment. The calculations are as follows:

-   -   i) Overflow from bit n−1=CarryOut(n−1) XOR CarryOut(n−2) of        adder, or    -   ii) Overflow=˜(SignProduct XOR SignAccumlator Operand) AND        (SignAdder XOR SignProduct)

When both operands in an 8×8 or 16×16 operation are unsigned, the valueof the 16- or 32-bit overflow bit is undefined. The accumulator overflowbits for unsigned addition are as follows: 64-bit UIM(n)overflow=“UIM(n) bit 63 carry-out”80-bit UIM(n) overflow=“UIM(n) bit 80carry-out”

Overflow bits are sticky and remain set unless cleared explicitly, forinstance, when the corresponding accumulator is cleared by accessing apostclear register or when a pre-clear instruction is executed.

In FIG. 6, in 16×16 mode, accumulator values 90 may be saturated afterthe accumulator values are provided. In one embodiment, the SA bit 34 inthe PSW is set to indicate saturation should occur if there is overflowfrom bit 31 94. If these conditions are met 112, and the overflow is inthe positive direction, then O×7fffffff 102 is sent to the register 110.If the overflow is in the negative direction, 0×ff80000000 104 is sentto the register.

In FIG. 7 a, in 16×16 mode a bit (in one embodiment, the SSP bit in thePSW) 32 may be set to shift left by one and saturate the product ifnecessary before it is sent to the register. When the bit 32 is set, theproduct from the multiplier block 62 is shifted left by 1 120 andsaturated 118 where necessary. When 0x8000 is multiplied by 0x8000, theresult is 0x40000000. When 0x40000000 is shifted left multiply by 2, thesign changes. If this occurs, the result must be saturated 18 to thegreatest positive number, Ox7FFFFFFF. The results can be sign extended66, depending on the operands 36, 38.

In FIG. 7 b, an alternative method 140 of saturation uses twocomparators 142, 144 to explicitly check the input operands. Saturationonly occurs if both input operands 152. 154 are 0x8000. A check ofwhether both inputs are 0x8000 146 will determine if saturation 118 isrequired.

FIG. 8 shows the combined saturation circuit 114; the aspects of thecombined circuit 114 have been discussed in FIGS. 6 and 7 a.

1. A method for performing parallel integer multiply accumulateoperations on packed data comprising: a) inputting to a MAC two z-bitwords containing a first number of m-bit operands; b) multiplying them-bit operands in a second number of m×m operations to obtain a thirdnumber of n-bit products; c) adding the content of at least one of thej-bit accumulators to the n-bit product of the corresponding m×moperation to obtain a fourth number of n-bit results; d) outputting thefourth number of n-bit results into a fifth number of j-bitaccumulators, wherein the fourth number of n-bit results are output inthe same cycle; and e) accessing at least one accumulator register toretrieve results, wherein the results may be individually retrieved andthe specific accumulator cleared after retrieval.
 2. The method of claim1 further comprising specifying an m×m mode for each cycle.
 3. Themethod of claim 2 wherein, depending on the mode specified, the inputand output are one of the following: a) each of the two z-bit words is32 bits and has four independent 8-bit operands and sixteen 16-bitproducts are accumulated into sixteen signed 20-bit accumulatorregisters, wherein each operand is independently configured as signed orunsigned; b) each of the two z-bit words is 32 bits and has twoindependent 16-bit operands and four 32-bit products are accumulatedinto four signed 40-bit accumulator registers, wherein each operand isindependently configured as signed or unsigned; or c) each of the twoz-bit words is 32 long and has one 32-bit operand and one 64-bit productis accumulated into a signed 80-bit accumulator register.
 4. The methodof claim 1 further comprising adding together results of at least two ofthe second number of m×m operations to assemble a partial product. 5.The method of claim 1 further comprising packing two results into asingle accumulator register.
 6. The method of claim 1 further comprisingstoring overflow bits in a register.
 7. The method of claim 1 furthercomprising globally initializing all accumulators.
 8. The method ofclaim 1 further comprising specifying an m×m mode for each cycle.
 9. Themethod of claim 1 further comprising shifting results.
 10. The method ofclaim 1 further comprising saturating results.
 11. The method of claim 1further comprising truncating results.
 12. An apparatus for performingparallel integer multiply accumulate operations on packed datacomprising: a) a first number of multipliers for multiplying two z-bitwords inputted into the apparatus, the two z-bit words having a secondnumber of m-bit operands; b) a third number of j-bit accumulators forstoring results of operations performed by the apparatus, each of thethird number of j-bit accumulators coupled to at least one of the firstnumber of multipliers; c) a fourth number of adders for combining theresult of at least one multiplier and a value stored in one of the thirdnumber of j-bit accumulators, each of the fourth number of adderscoupled to the at least one of the first number of multipliers; and d) afifth number of n-bit accumulator registers for accessing results ofoperations performed by the apparatus, each of the fifth number of n-bitaccumulator registers coupled to at least one of the third number ofj-bit accumulators, wherein the results stored in the n-bit accumulatorregisters may be individually retrieved and the accumulatorcorresponding to the accessed register cleared after retrieval.
 13. Theapparatus of claim 12 further comprising means for receiving an inputcommand specifying a mode of operation for the apparatus, wherein thereceiving means is coupled to the first number of multipliers.
 14. Theapparatus of claim 12 further comprising means for receiving two o-bitwords wherein the receiving means is coupled to the first number ofmultipliers.
 15. The apparatus of claim 13 wherein, depending on themode specified, the input and output are one of the following: a) eachof the two z-bit words is 32 bits and has four independent 8-bitoperands and sixteen 16-bit products are accumulated into sixteen signed20-bit accumulator registers, wherein each operand is independentlyconfigured as signed or unsigned; b) each of the two z-bit words is 32bits and has two independent 16-bit operands and four 32-bit productsare accumulated into four signed 40-bit accumulator registers, whereineach operand is independently configured as signed or unsigned; or c)each of the two z-bit words is 32 bits has one 32-bit operand and one64-bit product is accumulated into a signed 80-bit accumulator register.16. The apparatus of claim 12 further comprising adding means for addingtogether results of at least two of the first number of multipliers toassemble a partial product, wherein the adding means is coupled to thefirst number of multipliers.
 17. The apparatus of claim 12 furthercomprising means for globally initializing each of the third number ofaccumulators.
 18. The apparatus of claim 12 further comprising means forshifting results of operations.
 19. The apparatus of claim 12 furthercomprising means for saturating results of the operations.
 20. Theapparatus of claim 12 further comprising means for storing overflow bitsin a register.
 21. The apparatus of claim 12 further comprising meansfor truncating results of operations.
 22. The apparatus of claim 12wherein, depending on the input command, the apparatus operates in oneof the following modes: a) 8×8 mode; b) 16×16 mode; or c) 32×32 mode.23. A method for performing parallel integer multiply accumulateoperations on packed data comprising: a) receiving a command indicatinga mode of operation; b) receiving two z-bit words containing a firstnumber of m-bit operands; c) multiplying the m-bit operands in a secondnumber of m×m operations as required by the mode of operation; d)accumulating the result of the multiplication operations with a valuestored in an accumulator corresponding to the multiplier; e) accessingthe result of the accumulation operation, wherein an individual resultmay be obtained and the accumulator corresponding to an accessedregister is cleared after retrieval.
 24. The method of claim 23 wherein,depending on the mode specified, the input and output are one of thefollowing: a) each of the two z-bit words is 32 bits and has fourindependent 8-bit operands and sixteen 16-bit products are accumulatedinto sixteen signed 20-bit accumulator registers, wherein each operandis independently configured as signed or unsigned; b) each of the twoz-bit words is 32 bits and has two independent 16-bit operands and four32-bit products are accumulated into four signed 40-bit accumulatorregisters, wherein each operand is independently configured as signed orunsigned; or c) each of the two z-bit words is 32 bits and has one32-bit operand and one 64-bit product is accumulated into a signed80-bit accumulator register.
 25. The method of claim 23 furthercomprising adding together results of at least two of the second numberof m×m operations to assemble a partial product.
 26. The method of claim23 further comprising packing two results into a single accumulatorregister.
 27. The method of claim 23 further comprising storing overflowbits in a register.
 28. The method of claim 23 further comprisingglobally initializing all accumulators.
 29. The method of claim 23further comprising shifting results.
 30. The method of claim 23 furthercomprising saturating results.
 31. The method of claim 23 furthercomprising truncating results.